Investigate Titanic's Data

Questions to ask ourselves

What factors made people more likely to survive?

Sex
Class
Age
How much they paid



In [3]:

    
#imports
import pandas as pd
import numpy as np



In [4]:

    
raw_data = pd.read_csv('titanic_data.csv')



In [26]:

    
raw_data.head()









    Out[26]:






  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
  
  
    
      0
      1
      0
      3
      Braund, Mr. Owen Harris
      male
      22
      1
      0
      A/5 21171
      7.2500
      NaN
      S
    
    
      1
      2
      1
      1
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      female
      38
      1
      0
      PC 17599
      71.2833
      C85
      C
    
    
      2
      3
      1
      3
      Heikkinen, Miss. Laina
      female
      26
      0
      0
      STON/O2. 3101282
      7.9250
      NaN
      S
    
    
      3
      4
      1
      1
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      female
      35
      1
      0
      113803
      53.1000
      C123
      S
    
    
      4
      5
      0
      3
      Allen, Mr. William Henry
      male
      35
      0
      0
      373450
      8.0500
      NaN
      S

Data Wrangling

We need to find the amount of nulls that our data has.

describe function might be useful



In [9]:

    
raw_data.describe()









    Out[9]:






  
    
      
      PassengerId
      Survived
      Pclass
      Age
      SibSp
      Parch
      Fare
    
  
  
    
      count
      891.000000
      891.000000
      891.000000
      714.000000
      891.000000
      891.000000
      891.000000
    
    
      mean
      446.000000
      0.383838
      2.308642
      29.699118
      0.523008
      0.381594
      32.204208
    
    
      std
      257.353842
      0.486592
      0.836071
      14.526497
      1.102743
      0.806057
      49.693429
    
    
      min
      1.000000
      0.000000
      1.000000
      0.420000
      0.000000
      0.000000
      0.000000
    
    
      25%
      223.500000
      0.000000
      2.000000
      20.125000
      0.000000
      0.000000
      7.910400
    
    
      50%
      446.000000
      0.000000
      3.000000
      28.000000
      0.000000
      0.000000
      14.454200
    
    
      75%
      668.500000
      1.000000
      3.000000
      38.000000
      1.000000
      0.000000
      31.000000
    
    
      max
      891.000000
      1.000000
      3.000000
      80.000000
      8.000000
      6.000000
      512.329200

we realise however that in this way we are not able to see NA in non-numeric columns.

We move to another option:



In [12]:

    
raw_data.isnull().sum()









    Out[12]:





PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

How do we treat nulls

In AGE

Out of 891 rows, we have 177 NaN, which represent roughly a 20%. If we replace this NaN with some other value we should be guard value, so it does not affect the rest of the values.

In Cabin

Out of 891 rows, 687 are nulls, representing an astounding 77%. Ignoring this column altogether makes more sense.

In Embarked

Only 2 NaN in this column make it possible to simply ignore this rows. We could also decide another value and see how they behave.

Code

Age



In [6]:

    
clean_data = raw_data.copy()
clean_data['Age'] = clean_data['Age'].fillna(-1)

Cabin



In [7]:

    
clean_data.drop('Cabin', axis=1, inplace=True)

Embarked

Before deleting anything, let's check the rows



In [27]:

    
raw_data[raw_data['Embarked'].isnull()]









    Out[27]:






  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
  
  
    
      61
      62
      1
      1
      Icard, Miss. Amelie
      female
      38
      0
      0
      113572
      80
      B28
      NaN
    
    
      829
      830
      1
      1
      Stone, Mrs. George Nelson (Martha Evelyn)
      female
      62
      0
      0
      113572
      80
      B28
      NaN

It looks a bit strange that they both survived, are in the same Cabin and we lack their Embarked information, using the same ticket.

Instead of deleting them we will leave the rows for now.

This are configuration options for the charts.



In [5]:

    
%pylab inline
figsize(47,20)









    



Populating the interactive namespace from numpy and matplotlib

Data Exploration

We want to be able to see all this data depicted in this ways:

How many people survived?
Survival by age
Survival by sex
Survival by age and sex
Survival by age and class
Survival by sex and class

To be able to see where the survival rates are most gathered.

How many people survived?

As a first data exploration trade we are interested first, in how many people survived.



In [79]:

    
import matplotlib.pyplot as plt
survivors = clean_data.groupby('Survived').count()['Name']

plt.figure(figsize=(18,8))
cmap = plt.cm.hsv
colors = ['grey','cyan']
plt.pie(survivors, labels=['Died','Survived'], explode=[0,0.05], autopct='%1.1f%%', colors = colors)

plt.axis("equal")
plt.title("Titanic Survivors")
plt.show();

Survival by Age

Code



In [41]:

    
clean_data[clean_data['Survived'] == 1].groupby('Age').count().reset_index().plot(kind='bar',y='PassengerId', x='Age')
#pd.pivot_table(clean_data[clean_data['Survived'] == 1], index='Age', aggfunc=np.count_nonzero









    Out[41]:





<matplotlib.axes._subplots.AxesSubplot at 0x10ff86310>

But this is not very helpful, since we don't see how many people there was in each group. We can either represent both survivors or not, or calculate a ratio by age.

Let's see which helps us more.



In [84]:

    
#clean_data.groupby(['Age','Survived']).count().reset_index().plot(kind='bar',stacked = True, y='PassengerId', x='Age')
pivot_age = pd.pivot_table(clean_data, values='PassengerId', index='Age', columns='Survived', aggfunc=np.count_nonzero)
pivot_age.fillna(0).plot(kind='bar', stacked='True')









    Out[84]:





<matplotlib.axes._subplots.AxesSubplot at 0x1178e7f90>

From what we can see, not much information can be gained from age, but let's analyse by ratio, to be certain about that.



In [94]:

    
pivot_age = pivot_age.fillna(0)
pivot_age['survival_ratio'] = pivot_age[1] / (pivot_age[0] + pivot_age[1])
pivot_age.plot(kind = 'bar', y='survival_ratio')









    Out[94]:





<matplotlib.axes._subplots.AxesSubplot at 0x12274c4d0>

From this plot we can extract that the higher ratios of survival are up to 9 years, and between 11 and 14. Some other interesting ranges of age have good survival rates, like from 47 to 55.

Survival by Sex

Let's see which sex survived more.

Code



In [99]:

    
import matplotlib.pyplot as plt

fig, axes = plt.subplots(nrows=1, ncols=2)
survivors_male = clean_data[clean_data['Sex']=='male'].groupby('Survived').count()['Name']
survivors_female = clean_data[clean_data['Sex']=='female'].groupby('Survived').count()['Name']

colors = ['grey','cyan']

male_plot = survivors_male.plot(kind='pie', labels=['Died','Survived'], explode=[0,0.05], autopct='%1.1f%%', colors = colors, ax=axes[0])
male_plot.axis("equal")
male_plot.set_title("Male Titanic Survivors")

female_plot = survivors_female.plot(kind='pie', labels=['Died','Survived'], explode=[0,0.05], autopct='%1.1f%%', colors = colors, ax=axes[1])
female_plot.axis("equal")
female_plot.set_title("Female Titanic Survivors")









    Out[99]:





<matplotlib.text.Text at 0x12b090510>






    












    





<matplotlib.figure.Figure at 0x1227e4210>

As we can clearly see with this representation, we have a lot of females surviving. Around a 74 %.

Only with this information we could already have a pretty good prediction.

Survival by age and sex

An interesting set of visualizations might help us see if the highest survival ratios for males are skewed to one particular range of ages. Checking the dead ratio by age with females looks interesting, to avoid it as well.



In [103]:

    
survivors_male_age_pivot = clean_data[clean_data['Sex']=='male'].pivot_table(index='Age', columns='Survived', aggfunc=np.count_nonzero)
survivors_male_age_pivot = survivors_male_age_pivot.fillna(0)['PassengerId']
survivors_male_age_pivot['survival_ratio'] = survivors_male_age_pivot[1]/(survivors_male_age_pivot[1]+survivors_male_age_pivot[0])
survivors_male_age_pivot.plot(kind='bar', y='survival_ratio')









    Out[103]:





<matplotlib.axes._subplots.AxesSubplot at 0x11d2a3550>

With this representation we can clearly see that the 0 to 6 year old males are the ones that survive the most.

With females we want to study which where the ages that died the most, since we have a lot more women surviving.



In [104]:

    
survivors_female_age_pivot = clean_data[clean_data['Sex']=='female'].pivot_table(index='Age', columns='Survived', aggfunc=np.count_nonzero)
survivors_female_age_pivot = survivors_female_age_pivot.fillna(0)['PassengerId']
survivors_female_age_pivot['dead_ratio'] = survivors_female_age_pivot[0]/(survivors_female_age_pivot[1]+survivors_female_age_pivot[0])
survivors_female_age_pivot.plot(kind='bar', y='dead_ratio')









    Out[104]:





<matplotlib.axes._subplots.AxesSubplot at 0x10c9ecb90>

We would have expected something more clear, but this doesn't help us. There is no conclusion that we can draw from this data.

Survival by age and class

First we need to explore the different values we have in class.



In [108]:

    
clean_data['Pclass'].head()









    Out[108]:





0    3
1    1
2    3
3    1
4    3
Name: Pclass, dtype: int64

We see data is structured in values ranging from 1 to 3. Standin for 1st class (richer) to 3rd class (poorer).



In [109]:

    
survivors_first_age_pivot = get_survival_ratio_pivot(clean_data,'Pclass', 1)
survivors_first_age_pivot.plot(kind='bar', y='survival_ratio')









    Out[109]:





<matplotlib.axes._subplots.AxesSubplot at 0x10cf86890>



In [114]:

    
def get_survival_ratio_pivot(source, attribute, value):
    pivot = source[source[attribute]==value].pivot_table(index='Age', columns='Survived', aggfunc=np.count_nonzero)
    pivot = pivot.fillna(0)['PassengerId']
    pivot['survival_ratio'] = pivot[1]/(pivot[1]+pivot[0])
    return pivot



In [115]:

    
survivors_second_age_pivot = get_survival_ratio_pivot(clean_data,'Pclass', 2)
survivors_second_age_pivot.plot(kind='bar', y='survival_ratio')









    Out[115]:





<matplotlib.axes._subplots.AxesSubplot at 0x11d5a8c90>

This distribution is more revealing. People from second class only got saved if they were extremely young. At this point it would be helpful to know how many people this represented.



In [126]:

    
survivors_second_age_pivot.columns = ['Died', 'Survived', 'Ratio']
ssap_plot = survivors_second_age_pivot.plot(kind='bar',stacked = True, y=[0,1])
#ssap_plot.set_label(['Died','Survived'])



In [127]:

    
survivors_third_age_pivot = get_survival_ratio_pivot(clean_data,'Pclass', 3)
survivors_third_age_pivot.plot(kind='bar', y='survival_ratio')









    Out[127]:





<matplotlib.axes._subplots.AxesSubplot at 0x13aabf490>

This distribution shows that just by being on 3rd class, your chances of surviving were a lot lower. Let's calculate how lower.



In [130]:

    
survived_by_class = clean_data.pivot_table(index='Pclass', columns='Survived', aggfunc=np.count_nonzero)['PassengerId']
survived_by_class['ratio'] = survived_by_class[1]/(survived_by_class[1]+survived_by_class[0])
survived_by_class

The trend is clear. Less money, less possibility of survival.

Survival by sex and class

Let's get a pivot table representing as clearly as possible this information.



In [11]:

    
from pivottablejs import pivot_ui
pivot_ui(clean_data)









    Out[11]:

With the help of this tool we see that the best result is:



In [15]:

    
class_gender_pivot = pd.pivot_table(clean_data, index=['Pclass','Sex'],columns='Survived', aggfunc=np.count_nonzero)['PassengerId']
class_gender_pivot['survival_ratio'] = class_gender_pivot[1]/(class_gender_pivot[1]+class_gender_pivot[0])
class_gender_pivot









    Out[15]:






  
    
      
      Survived
      0
      1
      survival_ratio
    
    
      Pclass
      Sex
      
      
      
    
  
  
    
      1
      female
      3
      91
      0.968085
    
    
      male
      77
      45
      0.368852
    
    
      2
      female
      6
      70
      0.921053
    
    
      male
      91
      17
      0.157407
    
    
      3
      female
      72
      72
      0.500000
    
    
      male
      300
      47
      0.135447

With this informations we can say that higher class means life, specially for men, that have their chances more than doubled. Woman in higher and middle class survived. And woman in lower classes had exactly 50% chances of surviving.

Conclusions

After analysing the data, we can state that:

Females were more likely to survive than males.
Upper classes had higher survival ratios. First had the best survival ratio for men, while 1st and 2nd had best survival ratios for women.
Age was a factor but difficult to pin point precisely.

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35	0	373450	8.0500	NaN	S

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000000	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	446.000000	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	257.353842	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	1.000000	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	223.500000	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	446.000000	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	668.500000	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	891.000000	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
61	62	1	1	Icard, Miss. Amelie	female	38	0	0	113572	80	B28	NaN
829	830	1	1	Stone, Mrs. George Nelson (Martha Evelyn)	female	62	0	0	113572	80	B28	NaN

	Survived	0	1	survival_ratio
Pclass	Sex
1	female	3	91	0.968085
1	male	77	45	0.368852
2	female	6	70	0.921053
2	male	91	17	0.157407
3	female	72	72	0.500000
3	male	300	47	0.135447